NSF PAR Search | NSF Public Access Repository

Correlated Errors in Large Language Models

Kim, Elliot_Myunghoon; Garg, Avi; Peng, Kenny; Garg, Nikhil (June 2025, International Conference on Machine Learning)

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture.

Free, publicly-accessible full text available June 18, 2026

Search for: All records